-
-
Notifications
You must be signed in to change notification settings - Fork 8.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support dataframe data format in native XGBoost. #9828
Conversation
4008935
to
fa48de4
Compare
@david-cortes Could you please help take a look into the R interface? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left a few minor comments. I guess I could add int64 support later.
For some reason, even if just creating DMatrix (simplest possible task), the np array created from arrow extension is incredibly inefficient. I have checked the resulting DMatrices have the same hash. In addition, they run in the exact same code path inside libxgboost. |
Got it, it's caused by meta info instead of the covariate. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left a few more comments. I'm curious in particular about how these types interplay with feature types.
} | ||
})) | ||
## as.data.frame somehow converts integer/logical into real. | ||
data <- as.data.frame(sapply(data, function(x) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Does the result here need to be a
data.frame
? Maybe you could just calllapply
, as I think it would avoid one list copy. Note that, per the other comments I'm leaving here, if integer columns have any missing values, they might need to be coerced through e.g.as.numeric
. - Comment above says that it convers integers to real. Isn't the idea with the types above to handle them in their native type?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think as.data.frame
does the coercing for me.
Comment above says that it convers integers to real. Isn't the idea with the types above to handle them in their native type
Ideally, we would like to handle different types of columns independently without any coercing, and hence without any data copying. However, at the moment only cuDF input can be consumed in this way due to missing value handling. R uses sentinel values to indicate missing/NA, while XGBoost can't have more than one missing value indicator at the moment. As a result, a DF containing a float column and an integer column with NAs can confuse XGBoost what value it should eliminate. Is it NaN
or NA(int)
?
The cuDF uses arrow IPC format as its memory layout and exposes them as part of the API, missing values are represented by a bitmask, we can handle all the columns without any transformation (except for categorical encoding).
} else if (is.integer(x)) { | ||
"int" | ||
} else if (is.logical(x)) { | ||
"i" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Question there: I see this is also being used in the pandas adapter.
In R, boolean (logical) types are represented as C int
, where possible values are FALSE (zero), NA (-INT_MAX), and TRUE (everything else), while python's bool type has only True and False.
I see you mention in a comment later that these get converted to numeric type, but the C++ code still checks for integer/logical-typed columns.
What would happen with these missing values encoded as -INT_MAX if the columns are supplied in their original types?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As suggested by the comments in C++, those C++ handling code is not used but is more or less a reminder that we should try to avoid data transformation in R. I think the previous reply might help with the -INT_MAX
part.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can remove the code if it's hindering readability
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can remove the code if it's hindering readability
I actually was thinking something along the lines that using sapply
instead of lapply
+ unlist
would avoid one list copy operation. Haven't checked this hypothesis though. I don't think the code is unreadable or hard to understand.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the suggestion, I removed the unlist
as suggested in #9828 (comment) .
* @brief Create a DMatrix from columnar data. (table) | ||
* | ||
* @param data See @ref XGBoosterPredictFromColumnar for details. | ||
* @param config See @ref XGDMatrixCreateFromDense for details. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Something I'm wondering here: if this config already conveys the information about whether a column has integer type, is it actually needed to make a distinction between q
and int
in feature_types
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The columns don't convey the information accurately since we need to do some transformations before passing them into XGBoost. For instance, if a column is integer with missing values, we have to use float with NaN as an approximate.
Other than the |
cc @david-cortes DF for label and base margin is still not yet supported. These are useful for multi-output/multi-label problems. But we can work on them later. |
- Implement a columnar adapter. - Refactor Python pandas handling code to avoid converting into numpy. - Add support in R for transforming columns.
50eda8e
to
7bf3ccf
Compare
Added the |
@david-cortes Could you please help take another look? |
@@ -58,19 +61,28 @@ xgb.DMatrix <- function( | |||
qid = NULL, | |||
label_lower_bound = NULL, | |||
label_upper_bound = NULL, | |||
feature_weights = NULL | |||
feature_weights = NULL, | |||
enable_categorical = FALSE |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Question: do I understand it correctly that this parameter is only used to auto-detect categorical features from data frames, but would otherwise play no role if e.g. the user were to manually set this field in the DMatrix later through setinfo
, for example?
If so, how about renaming it to 'autodetect_categorical' or something along those lines? (both in the R and Python interfaces) Would also be ideal to describe a bit more of it in the docs (e.g. that it's only for data frames).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Question: do I understand it correctly that this parameter is only used to auto-detect categorical features from data frames, but would otherwise play no role if e.g. the user were to manually set this field in the DMatrix later through setinfo, for example?
Correct. It's more or less a guard to prevent surprise since XGBoost didn't accept categorical data before, which might cause issues in silence if we suddenly accept it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't have strong preference on the naming, we have an introductory document for cat data in the tutorials, feel free to add additional explanation.
LGTM. Left two small comments. |
This is not yet as efficient as we would like since we can't handle missing value indicators for each column. I will leave that as a future to-do instead.
Categorical data should work with R factor now, minus the uncertainty about how we should suggest people to use a consistent encoder.
related: #9810 .